After the variant call process the VCF file generated contains some artefacts that affect downstream analysis. Alignment artifacts can occur whenever there is sufficient sequence similarity between two or more regions in the genome to confuse the alignment algorithm. This can occur when the aligner for whatever reason overestimate how uniquely a read maps, thereby assigning it too high of a mapping quality. It can also occur through no fault of the aligner due to gaps in the reference, which can also hide the true position to which a read should map. Even thought GATK solve most of these artifacts some still persists. In this tutorial we will filter out artifacts from our variant calls and generate a ‘cleaner’ set of loci for downstream analysis.
For this step, the .vcf file is still too large to be
manipulated in our personal computer, for that reason this step had been
P. vivax has several genomic regions like the vir genes that
are hard to align because they
For this first step we require the following files:
.vcf.gz or .vcf files: If a
.vcf.gz is used, a .vcf.tbi will be also
required. For this tutorial the .vfc.gz file that we are
going to work with is:
ColombiaGates_Pviv.JointCall.filtered.combined.snpeff.vcf.gz.
.gff: the gff file of the reference genome will be
also required in order to filter coding regions.
noncore_regions.bed: This file will be used to
remove non-core regions that are difficult to sequence from the
analysis.
load_libraries.R: This R script file load all the
required packages that we need to perform the analysis.
functions.R: This R script file load all the
functions that we have created to perform the analysis. Within those
functions we a function named fx_run_vcftools that allow us
to manipulate the vcftools package from the R environment.
We have also created other functions to get information from the
.vcf file with loading the whole file in R environment.
This functions will help us the get information regarding to the name of
the samples and create a samples.indv to use as a filter in
the analysis. Other functions will be explained later in this
tutorial.
Filtering_vcf_file.R: This R script file contains
the instructions to perform the analysis in R. This script requires 9
arguments:
wd: Working directory where all temporal files and
final outputs will be stored.
fd: Directory where the pre-required files are
stored.
gzvcf: Path to the .vcf.gz
file.
o: prefix used for the temporal and final
outputs.
rkeep: regular expression to identify samples of
interest.
ebed: name of the noncore_regions.bed
file. This file should be stored in the fd path.
gff: name of the .gff file. This file
should be stored in the fd path.
n: number of iterations. In the final step of this
script we are going to transform our data from VCF format to rGenome
format. This step is computing exhausting and consumes RAM memory. For
that reason we subdivide this step in n instances to be
able to run in the server.
thres: Minimum read depth to call an allele in a
sample.
r_options.sh: bash script with the instructions to
run R. This file should be stored in wd. The scrip should
looks as follow:
##!/bin/bash
# source /broad/software/scripts/useuse
# use R-4.1
# Rscript /gsap/garage-protistvector/ColombiaData/Pviv/AnalysisTools/Filtering_vcf_file.R \
# -wd /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/post_vcall_analysis/ \
# -fd /gsap/garage-protistvector/ColombiaData/Pviv/AnalysisTools \
# -gzvcf /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/ColombiaGates_Pviv.JointCall.filtered.combined.snpeff.vcf.gz \
# -o ColombiaGates \
# -rkeep (Col|SP) \
# -ebed Pvivax_P01_noncore.bed \
# -gff genes.gff \
# -n 500 \
# -thres 5
uger_options.sh: bash script with the instruction to
run UGER. This file should be stored in wd. The scrip
should looks as follow:##!/bin/bash
#qsub -l h_vmem=32G \
# -l h_rt=01:00:00 \
# -o /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/post_vcall_analysis/output/ \
# /gsap/garage-protistvector/ColombiaData/Pviv/DataGeneration/post_vcall_analysis/r_options.sh
As a final output this step will generate and .RData
file containing the data in rGenome format.
setwd('~/Documents/Github/Plasmodium_WGS_analysis/Pre_filtering/')
source('../functions_libraries/load_libraries.R')
for(file in list.files('PerVen/Pfal/', pattern = '.RData')){
load(file.path('PerVen/Pfal/', file))
rm(list = ls()[!grepl('Pf3D7_(\\d+|MIT|API)_v3_rGenome_object', ls())])
}
rGenome_objects = list(Pf3D7_01_v3_rGenome_object,
Pf3D7_02_v3_rGenome_object,
Pf3D7_03_v3_rGenome_object,
Pf3D7_04_v3_rGenome_object,
Pf3D7_05_v3_rGenome_object,
Pf3D7_06_v3_rGenome_object,
Pf3D7_07_v3_rGenome_object,
Pf3D7_08_v3_rGenome_object,
Pf3D7_09_v3_rGenome_object,
Pf3D7_10_v3_rGenome_object,
Pf3D7_11_v3_rGenome_object,
Pf3D7_12_v3_rGenome_object,
Pf3D7_13_v3_rGenome_object,
Pf3D7_14_v3_rGenome_object)
rm(list = c('Pf3D7_01_v3_rGenome_object',
'Pf3D7_02_v3_rGenome_object',
'Pf3D7_03_v3_rGenome_object',
'Pf3D7_04_v3_rGenome_object',
'Pf3D7_05_v3_rGenome_object',
'Pf3D7_06_v3_rGenome_object',
'Pf3D7_07_v3_rGenome_object',
'Pf3D7_08_v3_rGenome_object',
'Pf3D7_09_v3_rGenome_object',
'Pf3D7_10_v3_rGenome_object',
'Pf3D7_11_v3_rGenome_object',
'Pf3D7_12_v3_rGenome_object',
'Pf3D7_13_v3_rGenome_object',
'Pf3D7_14_v3_rGenome_object'))
source('../functions_libraries/functions.R')
#sourceCpp('../functions_libraries/Rcpp_functions.cpp')
Pfal_rGenome_object = rGenome()
for(obj in 1:length(rGenome_objects)){
Pfal_rGenome_object@data$gt = rbind(Pfal_rGenome_object@data$gt, rGenome_objects[[obj]]@data$gt)
Pfal_rGenome_object@data$loci_table = rbind(Pfal_rGenome_object@data$loci_table, rGenome_objects[[obj]]@data$loci_table)
}
Pfal_rGenome_object@data$metadata = rbind(Pfal_rGenome_object@data$metadata, rGenome_objects[[obj]]@data$metadata)
rm(list = c('rGenome_objects', 'obj'))
# There are some spelling mistakes in the sample codes of the sequencing files, The table VENPER2019_2022.csv match the incorrect codes with the correct codes
external_metadata = read.csv('../GeneticDiversity/merged_metadata.csv')
# PerVen_seq_codes = read.csv('PerVen/VENPER2019_2022.csv')
# Remove Duplicated records
external_metadata = external_metadata[!duplicated(external_metadata$Sample_id),]
# PerVen_seq_codes = PerVen_seq_codes[!duplicated(PerVen_seq_codes$PreferedSampleID),]
# merge incorrect and correcte codes with metadata
# external_metadata = merge(external_metadata, PerVen_seq_codes, by.x = 'Sample_id', by.y = 'AlternateSampleID', all.x = T)
# external_metadata %<>% mutate(PreferedSampleID = case_when(
# is.na(PreferedSampleID) ~ Sample_id,
# !is.na(PreferedSampleID) ~ PreferedSampleID
# ))
# Add metadata to the rGenome object
Pfal_rGenome_object@data$metadata = merge(Pfal_rGenome_object@data$metadata, external_metadata[,c("Sample_id", "Study", "Country", 'Date_of_Collection', "Year_of_Collection",'Subnational_level0', 'Subnational_level1', "Subnational_level2")], by.x = 'Sample_id', by.y = 'Sample_id', all.x = T)
# Sorting samples
rownames(Pfal_rGenome_object@data$metadata) =
Pfal_rGenome_object@data$metadata$Sample_id
Pfal_rGenome_object@data$metadata =
Pfal_rGenome_object@data$metadata[colnames(Pfal_rGenome_object@data$gt),]
3.1 Sample amplification rate
Pfal_rGenome_object = SampleAmplRate(Pfal_rGenome_object)
3.2 Check the distribution of the amplification rate of the samples using a histogram
Pfal_rGenome_object@data$metadata %>% ggplot(aes(x = SampleAmplRate, fill = Country))+
geom_histogram(position = 'stack', binwidth = .05, color = 'gray40')+
theme_bw()+
labs(title = 'Sample Amplification Rate distribution',
x = 'Amplification Rate')
3.3 Remove samples with less than 75% of amplified loci (> 25% of missing data)
Pfal_rGenome_object = filter_samples(obj = Pfal_rGenome_object, v =
Pfal_rGenome_object@data$metadata$SampleAmplRate >= .75)
4.1 Update allele counts
Pfal_rGenome_object@data$loci_table = cbind(Pfal_rGenome_object@data$loci_table,
get_AC(obj = Pfal_rGenome_object))
4.2 Filter monomorphic sites
Pfal_rGenome_object = filter_loci(Pfal_rGenome_object,
v = Pfal_rGenome_object@data$loci_table$Cardinality > 1)
5.1 Calculate the amplification rate of each loci
Pfal_rGenome_object = LocusAmplRate(Pfal_rGenome_object)
5.2 Check the distribution of the amplification rate of the samples using a histogram
Pfal_rGenome_object@data$loci_table %>%
mutate(CHROM2 = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
ggplot(aes(x = POS, y = LocusAmplRate)) +
geom_point(alpha = 0.7, size = .25) +
facet_grid(.~ CHROM2)+
theme_bw()+
labs(y = 'Ampl. Rate', x = 'Chromosomal position')+
theme(legend.position = 'right',
axis.text.x = element_blank())
5.3 Filter loci with amplification rate below .75
Pfal_rGenome_object = filter_loci(Pfal_rGenome_object, v = Pfal_rGenome_object@data$loci_table$LocusAmplRate >= .75)
Pfal_rGenome_object@data$loci_table$ALT =
ifelse(Pfal_rGenome_object@data$loci_table$REF !=
gsub(',([ATCG]|\\*)+',
'',
gsub(':\\d+',
'',
Pfal_rGenome_object@data$loci_table$Alleles)),
gsub(':\\d+',
'',
Pfal_rGenome_object@data$loci_table$Alleles),
gsub('^([ATCG]|\\*)+,',
'',
gsub(':\\d+', '', Pfal_rGenome_object@data$loci_table$Alleles))
)
Pfal_rGenome_object@data$loci_table$TypeOf_Markers =
TypeOf_Marker(Pfal_rGenome_object, w = 1, n = 1)
Pfal_rGenome_object@data$loci_table$ObsHet = get_ObsHet(Pfal_rGenome_object, by = 'loci', w = 1, n = 1)
Pfal_rGenome_object@data$loci_table$frac_ofHet_pAlts =
frac_ofHet_pAlt(Pfal_rGenome_object, w = 1, n = 1)
10.1 Distribution of the fraction of alternative alleles in polyclonal samples
Pfal_rGenome_object@data$loci_table %>% ggplot(aes(x = frac_ofHet_pAlts))+
geom_histogram(binwidth = 0.01)+
labs(x = 'Fraction of heterozygous samples per alternative allele per site',
y = 'Number of sites (Loci)')+
theme_bw()
10.2 Classify sites based on the proportion of alternative alleles in
polyclonal samples per site
Pfal_rGenome_object@data$loci_table %<>% mutate(ALT_FILTER =case_when(
frac_ofHet_pAlts == 1 ~ '100%',
frac_ofHet_pAlts < 1 & frac_ofHet_pAlts > .5 ~ '50 - 99%',
frac_ofHet_pAlts <= .5 ~ '<=50%',
))
10.3 Distribution of observed heterozygosity per locus, by type of marker, by group (defined in previuos step 10.2)
Pfal_rGenome_object@data$loci_table %>%
ggplot(aes(x = ObsHet,
fill = factor(ALT_FILTER,
levels = c('<=50%', '50 - 99%', '100%'))))+
geom_histogram(binwidth = .01)+
scale_fill_manual(values = c('dodgerblue3', 'gold3', 'firebrick3'))+
facet_wrap(.~factor(TypeOf_Markers, levels = c(
'SNP',
'INDEL',
'INDEL:Homopolymer',
'INDEL:Dinucleotide_STR',
'INDEL:Trinucleotide_STR',
'INDEL:Tetranucleotide_STR',
'INDEL:Pentanucleotide_STR',
'INDEL:Hexanucleotide_STR'
)), scales = 'free_y', ncol = 4)+
labs(y = 'Number of Loci',
x = 'Observed Heterozygosity per locus',
fill = 'Het/Alt')+
theme_bw()
Pfal_rGenome_object = filter_loci(Pfal_rGenome_object,
v = !(Pfal_rGenome_object@data$loci_table$TypeOf_Markers %in%
c('INDEL:Homopolymer', 'INDEL:Dinucleotide_STR')))
# In that way we identify that maximum fraction of Heterozygous samples per loci is 0.4855491
Pfal_rGenome_object@data$loci_table %>%
mutate(TypeOf_Markers2 = case_when(
TypeOf_Markers == 'SNP' ~ 'SNP',
TypeOf_Markers != 'SNP' ~ 'INDELs'),
CHROM2 = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
ggplot(aes(x = POS, y = ObsHet, color = factor(ALT_FILTER, levels = c('<=50%', '50 - 99%', '100%')))) +
geom_point(alpha = 0.5, size = .25) +
scale_color_manual(values = c('dodgerblue3', 'gold3', 'firebrick3'))+
facet_grid(TypeOf_Markers2 ~ CHROM2)+
theme_bw()+
labs(y = 'Observed Heterozygosity', x = 'Chromosomal position', color = 'Het/Alt')+
theme(legend.position = 'right',
axis.text.x = element_blank(),
axis.title.x = element_blank())
Pfal_rGenome_object@data$loci_table = mean_ObsHet(Pfal_rGenome_object, gff = '../reference/PlasmoDB-59_Pfalciparum3D7.gff')
ObsHet_Threshold = quantile(Pfal_rGenome_object@data$loci_table %>%
filter(ALT_FILTER == '<=50%', TypeOf_Markers == 'SNP') %>%
group_by(gene_id) %>%
dplyr::summarise(mean_ObsHet = max(mean_ObsHet)) %>%
select(mean_ObsHet) %>%
unlist, .95)
Pfal_rGenome_object@data$loci_table %>%
filter(ALT_FILTER == '<=50%', TypeOf_Markers == 'SNP') %>%
group_by(gene_id) %>%
dplyr::summarise(mean_ObsHet = max(mean_ObsHet)) %>%
ggplot(aes(x = mean_ObsHet))+
geom_histogram(binwidth = .005)+
geom_vline(xintercept = ObsHet_Threshold)+
labs(y = 'Number of genomic regions', x = 'Average Observed Heteozygosity in SNPs') +
theme_bw()
Pfal_rGenome_object@data$loci_table %<>% mutate(ObsHet_Filter = case_when(
mean_ObsHet > ObsHet_Threshold ~ 'OUT',
mean_ObsHet <= ObsHet_Threshold ~ 'PASS'
))
Pfal_rGenome_object@data$loci_table %>%
mutate(TypeOf_Markers2 = case_when(
TypeOf_Markers == 'SNP' ~ 'SNP',
TypeOf_Markers != 'SNP' ~ 'INDEL'),
CHROM2 = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
ggplot(aes(x = POS, y = ObsHet, color = ObsHet_Filter)) +
geom_point(alpha = 0.5, size = .25) +
geom_hline(yintercept = ObsHet_Threshold)+
scale_color_manual(values = c('firebrick2', 'dodgerblue3'))+
facet_grid(TypeOf_Markers2 ~ CHROM2)+
theme_bw()+
labs(y = 'Observed Heterozygosity',
x = 'Chromosomal position',
color = 'Mean ObsHet > Th')+
theme(legend.position = 'right',
axis.text.x = element_blank())
Pfal_rGenome_object@data$loci_table = SNP_density(Pfal_rGenome_object, gff = '../reference/PlasmoDB-59_Pfalciparum3D7.gff')
SNP_density_threshold = quantile(Pfal_rGenome_object@data$loci_table %>%
filter(TypeOf_Markers == 'SNP') %>%
group_by(gene_id) %>%
dplyr::summarise(SNP_density = max(SNP_density)) %>%
select(SNP_density) %>%
unlist, .95)
Pfal_rGenome_object@data$loci_table %>%
filter(TypeOf_Markers == 'SNP') %>%
group_by(gene_id) %>%
dplyr::summarise(SNP_density = max(SNP_density)) %>%
ggplot(aes(x = SNP_density))+
geom_histogram(binwidth = .001)+
geom_vline(xintercept = SNP_density_threshold)+
labs(y = 'Number of genomic regions', x = 'SNP density') +
theme_bw()
Pfal_rGenome_object@data$loci_table %<>%
mutate(SNP_density_Filter = case_when(
SNP_density > SNP_density_threshold ~ 'OUT',
SNP_density <= SNP_density_threshold ~ 'PASS'
))
Pfal_rGenome_object@data$loci_table %>%
mutate(TypeOf_Markers2 = case_when(
TypeOf_Markers == 'SNP' ~ 'SNP',
TypeOf_Markers != 'SNP' ~ 'INDEL'),
CHROM2 = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
ggplot(aes(x = POS, y = ObsHet, color = SNP_density_Filter)) +
geom_point(alpha = 0.5, size = .25) +
scale_color_manual(values = c('firebrick2', 'dodgerblue3'))+
facet_grid(TypeOf_Markers2 ~ CHROM2)+
theme_bw()+
labs(y = 'Observed Heterozygosity', x = 'Chromosomal position', color = 'SNP density')+
theme(legend.position = 'right',
axis.text.x = element_blank())
Pfal_rGenome_object@data$loci_table %<>%
mutate(Alignment_Filter = case_when(
ObsHet_Filter == 'OUT' & SNP_density_Filter == 'OUT' ~ 'OUT',
!(ObsHet_Filter == 'OUT' & SNP_density_Filter == 'OUT') ~ 'PASS'
))
Pfal_rGenome_object@data$loci_table %>%
mutate(TypeOf_Markers2 = case_when(
TypeOf_Markers == 'SNP' ~ 'SNP',
TypeOf_Markers != 'SNP' ~ 'INDEL'),
CHROM2 = gsub('(^(\\d|P|v)+_|_v1)', '',CHROM))%>%
ggplot(aes(x = POS, y = ObsHet, color = Alignment_Filter)) +
geom_point(alpha = 0.5, size = .25) +
scale_color_manual(values = c('firebrick2', 'dodgerblue3'))+
facet_grid(TypeOf_Markers2 ~ CHROM2)+
theme_bw()+
labs(y = 'Observed Heterozygosity', x = 'Chromosomal position', color = 'Alignment\nFilter')+
theme(legend.position = 'right',
axis.text.x = element_blank())
Pfal_rGenome_object@data$loci_table %>%
filter(Alignment_Filter == 'OUT') %>%
select(gene_id) %>%
unique()
## gene_id
## Pf3D7_01_v3_542041 PF3D7_0114000
## Pf3D7_02_v3_863081 PF3D7_0221500
## Pf3D7_03_v3_259963 PF3D7_0305400
## Pf3D7_03_v3_1007410 PF3D7_0324100
## Pf3D7_04_v3_38961 PF3D7_0400200
## Pf3D7_04_v3_63131 PF3D7_0400700
## Pf3D7_04_v3_89170 PF3D7_0401500
## Pf3D7_04_v3_156081 PF3D7_0402500
## Pf3D7_04_v3_1136672 PF3D7_0425200
## Pf3D7_05_v3_33564 PF3D7_0500400
## Pf3D7_06_v3_26967 PF3D7_0600500
## Pf3D7_06_v3_722448 PF3D7_0617400
## Pf3D7_07_v3_70826 PF3D7_0701600
## Pf3D7_07_v3_525381 PF3D7_0711900
## Pf3D7_07_v3_562933 PF3D7_0712500
## Pf3D7_07_v3_566727 PF3D7_0712600
## Pf3D7_08_v3_49758 PF3D7_0800300
## Pf3D7_08_v3_1420832 PF3D7_0833100
## Pf3D7_09_v3_64259 PF3D7_0901200
## Pf3D7_09_v3_1010517 PF3D7_0924900
## Pf3D7_11_v3_52379 PF3D7_1100500
## Pf3D7_12_v3_44423 PF3D7_1200500
## Pf3D7_12_v3_767122 PF3D7_1219300
## Pf3D7_12_v3_1715854 PF3D7_1240500
## Pf3D7_14_v3_30103 PF3D7_1400900
## Pf3D7_14_v3_1375581 PF3D7_1434400
Pfal_rGenome_object =
filter_loci(Pfal_rGenome_object,
v = Pfal_rGenome_object@data$loci_table$Alignment_Filter == 'PASS')
Pfal_rGenome_object@data$metadata$fracHet =
get_ObsHet(obj = Pfal_rGenome_object, by = 'sample')
Pfal_rGenome_object@data$metadata$Fws =
get_Fws(obj = Pfal_rGenome_object, w = 1, n = 1)
Pfal_rGenome_object@data$metadata %>%
ggplot(aes(x = fracHet, y = Fws))+
geom_point()+
geom_hline(yintercept = .97)+
geom_hline(yintercept = .88)+
geom_vline(xintercept = .015)+
geom_vline(xintercept = .06)+
facet_grid(.~Country)+
labs(x = 'Fraction of heterozygous loci')+
theme_bw()
Pfal_rGenome_object@data$metadata %<>%
mutate(Clonality = case_when(
Fws >= .97 ~ 'Monoclonal',
Fws < .97 & Fws >= .88 ~ 'Highly related Polyclonal',
Fws < .88 ~ 'Polyclonal'
))
rm(list = ls()[-grep('Pfal_rGenome_object',ls())])
save.image('../GeneticDiversity/PerVen_Pviv_filtered_rGenome.RData')